# Statement Annotation Pipeline

This folder contains the complete pipeline for processing and annotating political statements about infrastructure and clean energy investments using automated GPT-based coding.

## Overview

The annotation pipeline transforms raw statement data into structured annotations that identify:
- Whether statements contain credit claims
- Who receives credit (Biden, senators, representatives, governors, local officials, parties)
- Which policies receive credit (IRA, Bipartisan Infrastructure Law)

## Pipeline Steps

### 1. Data Processing (`01_process.R`)

**Purpose**: Clean and prepare raw statement data for annotation

**Input**: `data/input/credit/Statements_20250813.xlsx`

**Key Operations**:
- Imports raw Excel file containing statements from multiple political actors
- Cleans column names and formats dates
- Validates date consistency (checks for statements before IRA passage date: 2022-08-16)
- Removes bracket annotations and converts to long format
- Creates statement metadata including speaker names, roles, and release types
- Preprocesses statement text (removes URLs, brackets, standardizes punctuation)
- Assigns unique statement IDs

**Output**: `data/inter/statements_processed.csv`

**Key Variables Created**:
- `statement_id`: Unique identifier for each statement
- `statement`: Cleaned statement text
- `speaker_name`: Name of the person/entity making the statement
- `speaker_role`: Role (Company, Governor, U.S. Senator, U.S. Representative, President)
- `release_type`: Type of press release or statement
- Various date columns for statement timing

### 2. GPT Annotation (`02_annotate.R`)

**Purpose**: Automated annotation of statements using OpenAI GPT models

**Input**: `data/inter/statements_processed.csv`

**Annotation Process**:

#### Stage 1: Binary Screening
- Uses `gpt-3.5-turbo-0125` for fast binary classification
- Screens each statement for mentions of:
  - Inflation Reduction Act (IRA)
  - Bipartisan Infrastructure Law (BIL)
- Determines if laws directly funded, enabled, or influenced the specific project

#### Stage 2: Detailed Coding
- Uses `gpt-4o-mini` for comprehensive annotation
- Follows detailed codebook (`../codebook.md`) for credit attribution
- Codes 10 binary variables per statement:
  - `gives_credit`: Whether statement contains any credit claim
  - `credit_biden`: Credit to President Biden
  - `credit_senate`: Credit to U.S. senators
  - `credit_us_rep`: Credit to U.S. representatives  
  - `credit_governor`: Credit to governors
  - `credit_local`: Credit to local officials
  - `credit_dem`: Credit to Democratic Party
  - `credit_gop`: Credit to Republican Party
  - `credit_ira`: Credit to Inflation Reduction Act
  - `credit_bil`: Credit to Bipartisan Infrastructure Law

**Key Features**:
- **Resume capability**: Can resume from interruptions using cached results
- **Error handling**: Robust retry logic for API failures
- **Progress tracking**: Real-time progress bar and logging
- **Caching system**: Saves intermediate and final results with date stamps
- **Post-processing filters**: Validates that credited entities are explicitly mentioned

**Outputs**:
- `data/inter/annotated_statements.csv`: Main output for downstream analysis
- `data/api_cache/statement_annotations/{YYYYMMDD}/annotated_statements.csv`: Archived copy
- `data/api_cache/statement_annotations/{YYYYMMDD}/intermediate_results.rds`: Resume cache

### 3. Quality Analysis (`analyze_quality.R`)

**Purpose**: Evaluate annotation quality through human-AI agreement analysis

**Input**: 
- `data/inter/annotated_statements.csv`: GPT annotations
- Quality control files from human coders

**Analysis**:
- Compares GPT annotations with human coder annotations on subset of statements
- Calculates agreement metrics (percent agreement, kappa statistics)
- Identifies systematic disagreements between human and AI coding
- Generates quality control reports

**Output**: Quality analysis results and agreement statistics

## File Structure

```
annotation/
├── README.md                    # This documentation
├── process.R                    # Data cleaning and preprocessing
├── annotate.R                   # GPT-based annotation
└── analyze_quality.R            # Quality control analysis
```

## Dependencies

### R Packages
```r
# Data manipulation
library(tidyverse)
library(tidylog)
library(here)
library(readxl)
library(janitor)
library(lubridate)

# API and JSON
library(jsonlite)
library(httr)
library(openai)
library(progress)
```

### External Files
- `../codebook.md`: Detailed annotation guidelines
- `code/utils/gpt_utils.R`: Utility functions for GPT operations

## Configuration

### API Setup
- Requires OpenAI API key in environment variables
- Uses `check_api_key()` utility function for validation

### Key Parameters
- **Models**: 
  - Stage 1: `gpt-3.5-turbo-0125` (fast binary classification)
  - Stage 2: `gpt-4o-mini` (detailed coding)
- **Rate limiting**: 1-second delay between API calls (configurable)
- **Caching**: Saves every 10 iterations (configurable)
- **Temperature**: 0 (deterministic outputs)

## Usage

### Run Complete Pipeline
```r
# 1. Process raw data
source(here("R", "annotation", "process.R"))

# 2. Run GPT annotation
source(here("R", "annotation", "annotate.R"))

# 3. Analyze quality (optional)
source(here("R", "annotation", "analyze_quality.R"))
```

### Resume Interrupted Annotation
The annotation script automatically resumes from the last saved checkpoint:
```r
# Will resume from where it left off
source(here("R", "annotation", "annotate.R"))
```

### Run Robustness Checks with Alternative Codebooks and Probes
You can test annotation consistency using alternative codebooks and/or probe templates:

**Method 1: Modify main script**
```r
# Edit annotate.R to set:
ROBUSTNESS_CHECK <- "strict_all"
CUSTOM_CODEBOOK <- here("R", "annotation", "codebook_alt_strict.md")
CUSTOM_PROBE_TEMPLATE <- here("R", "annotation", "probe_template_strict.md")
source(here("R", "annotation", "annotate.R"))

# Or test only probe sensitivity:
ROBUSTNESS_CHECK <- "strict_probe_only"
CUSTOM_PROBE_TEMPLATE <- here("R", "annotation", "probe_template_strict.md")
```

**Method 2: Use dedicated robustness script**
```r
# Edit run_robustness_check.R and run
source(here("R", "annotation", "run_robustness_check.R"))
```

**Available Templates:**
- Codebooks: `codebook.md` (default), `codebook_alt.md`, `codebook_alt_strict.md`
- Probe templates: `probe_template.md` (default), `probe_template_strict.md`, `probe_template_lenient.md`

Robustness check outputs are saved separately:
- `data/inter/annotated_statements_[ID].csv`
- `data/api_cache/statement_annotations/robustness/[ID]/[DATE]/`

### Load Most Recent Results
```r
# Load primary annotations
annotations <- load_most_recent_annotations()

# Load robustness check results
annotations_robust <- load_most_recent_annotations(robustness_check = "strict_codebook")
```

## Output Data Structure

The final annotated dataset (`annotated_statements.csv`) contains:

- **Original statement data**: All variables from processed statements
- **Binary annotation variables**: 10 credit attribution variables (0/1)
- **Metadata**: Statement IDs, dates, speakers, roles

Example:
```
statement_id | statement | speaker_name | gives_credit | credit_biden | credit_ira | ...
1001        | "Thanks to the IRA..." | John Smith | 1 | 0 | 1 | ...
1002        | "Our state secured..." | Jane Doe   | 1 | 0 | 0 | ...
```

## Quality Control

### Validation Checks
- Date consistency validation (pre-IRA statements flagged)
- Explicit mention validation (credited entities must be mentioned in text)
- JSON parsing error handling
- Resume capability with integrity checks

### Human-AI Agreement
- Quality control subset coded by multiple human annotators
- Agreement metrics calculated at variable and statement level
- Systematic disagreement patterns identified

### Caching Strategy
- **Intermediate results**: Saved every 10 iterations during annotation
- **Final results**: Saved to both `data/inter/` and date-stamped cache
- **Backward compatibility**: Can load old RDS format files
- **Resume logic**: Automatically detects and resumes from last valid checkpoint

## Error Handling

- **API failures**: Exponential backoff retry logic
- **Rate limits**: Automatic delay and retry
- **JSON parsing**: Graceful handling of malformed responses
- **Missing data**: Proper NA handling throughout pipeline
- **File I/O**: Robust path checking and directory creation

## Performance Notes

- **Processing time**: ~1-2 seconds per statement (due to API calls)
- **Batch size**: Processes statements sequentially with progress tracking
- **Memory usage**: Efficient streaming with periodic saves
- **Cost**: Uses cost-effective models (3.5-turbo for screening, 4o-mini for coding)

## Troubleshooting

### Common Issues
1. **API key not found**: Set `OPENAI_API_KEY` environment variable
2. **Rate limiting**: Increase `sleep_sec` parameter in annotation function
3. **Memory issues**: Lower `save_every` parameter for more frequent saves
4. **Interrupted runs**: Script will automatically resume from last checkpoint

### Debug Mode
Enable verbose logging by checking utility functions in `code/utils/gpt_utils.R`

---

**Replication:** Do not set `new_annotation <- TRUE` when reproducing results; use the pre-computed annotations in cache. The custom probe template in `annotate.R` is for robustness checks only.

*Last updated: January 2025*
*Author: Alexander F. Gazmararian (agazmararian@gmail.com)*
